Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems

نویسندگان

  • Xingfu Wu
  • Valerie Taylor
چکیده

Chip multiprocessors (CMP) are widely used for high performance computing. While this presents significant new opportunities, such as on-chip high inter-core bandwidth and low inter-core latency, it also presents new challenges in the form of inter-core resource conflict and contention. A challenge to be addressed is how well current parallel programming paradigms, such as MPI, OpenMP and hybrid, exploit the potential offered by such CMP clusters for scientific applications. In this paper, we use processor partitioning as a term about how many cores per node to use for application execution to analyze and compare the performance of MPI, OpenMP and hybrid parallel applications on two dualand quad-core Cray XT4 systems, Jaguar with quad-core at Oak Ridge National Laboratory (ORNL) and Franklin with dual-core at the DOE National Energy Research Scientific Computing Center (NERSC). We conduct detailed performance experiments to identify the major application characteristics that affect processor partitioning. The experimental results indicate that processor partitioning can have a significant impact on performance of a parallel scientific application as determined by its communication and memory requirements. We also use the STREAM memory benchmarks and Intel’s MPI benchmarks to explore the performance impact of different application characteristics. The results are then utilized to explain the performance results of processor partitioning using NAS Parallel Benchmarks. In addition to using these benchmarks, we also use a flagship SciDAC fusion microturbulence code (hybrid MPI/OpenMP): a 3D particle-in-cell application Gyrokinetic Toroidal Code (GTC) in magnetic fusion to analyze and compare the performance of these MPI and hybrid programs on the dualand quad-core Cray XT4 systems, and study their scalability on up to 8192 cores. Based on the performance of GTC on up to 8192 cores, we use Prophesy system to online generate its performance models to predict its performance on more than 10,000 cores on the two Cray XT4 systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Finite Element Earthquake Rupture Simulations on Quad- and Hex-core Cray XT Systems

In this paper, we integrate a 3D mesh generator into the simulation, and use MPI to parallelize the 3D mesh generator, illustrate an element-based partitioning scheme for explicit finite element methods, and based on the partitioning scheme and what we learned from our previous work, we implement our hybrid MPI/OpenMP finite element earthquake simulation code in order to not only achieve multip...

متن کامل

Parallel Multigrid Solvers Using OpenMP/MPI Hybrid Programming Models on Multi-Core/Multi-Socket Clusters

OpenMP/MPI hybrid parallel programming models were implemented to 3D finite-volume based simulation code for groundwater flow problems through heterogeneous porous media using parallel iterative solvers with multigrid preconditioning. Performance and robustness of the developed code has been evaluated on the “T2K Open Supercomputer (Tokyo)” and “Cray-XT4” using up to 1,024 cores through both of...

متن کامل

Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4

We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 at National Energy Research Scientific Computing Center (NERSC). Previous work showed that multicorespecific auto-tuning can improve the performance of lattice Boltzmann magnetohydrodynamics (LBMHD) by a factor of 4× when running on dualand quad-core Opteron dual-socket SMPs. We extend these stud...

متن کامل

Impact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation

An upgrade from dual-core to quad-core AMD processor on the Cray XT system at the Oak Ridge National Laboratory (ORNL) Leadership Computing Facility (LCF) has resulted in significant changes in the hardware and software stack, including a deeper memory hierarchy, SIMD instructions and a multi-core aware MPI library. In this paper, we evaluate impact of a subset of these key changes on large-sca...

متن کامل

Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore Clusters

The NAS Parallel Benchmarks (NPB) are well-known applications with the fixed algorithms for evaluating parallel systems and tools. Multicore clusters provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the data sharing with the multicores that comprise a node and MPI can be used with the communication between nodes. In this paper, we use SP and BT benchma...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009